Skip to content

Add health check to watch.stream for silent connection drops#2525

Open
Urvashi0109 wants to merge 1 commit intokubernetes-client:masterfrom
Urvashi0109:Fix-Watch-Health-Check
Open

Add health check to watch.stream for silent connection drops#2525
Urvashi0109 wants to merge 1 commit intokubernetes-client:masterfrom
Urvashi0109:Fix-Watch-Health-Check

Conversation

@Urvashi0109
Copy link
Contributor

What type of PR is this?

/kind bug
/kind feature

What this PR does / why we need it:

When running a watch on Kubernetes objects (e.g., Jobs, Pods, Namespaces) and the Kubernetes control plane gets upgraded, the watch connection is silently dropped. The watcher hangs indefinitely - No exception is raised and no new events are received. This is because the TCP connection enters a half-open state where the client believes the connection is still alive, but the server side has been torn down during the upgrade.

This PR adds a _health_check_interval parameter to watch.stream() that detects silent connection drops and automatically reconnects:

  • When _health_check_interval is set to a value > 0, a socket-level read timeout (_request_timeout) is configured on the HTTP connection
  • If no data arrives within the specified interval, urllib3 raises a ReadTimeoutError
  • The watch catches this exception and automatically reconnects using the last known resource_version, ensuring no events are missed
  • The feature is disabled by default (_health_check_interval=0), preserving full backward compatibility
  • When disabled, ReadTimeoutError propagates to the caller as before

Which issue(s) this PR fixes:

Fixes #2462

Special notes for your reviewer:

  • This PR takes approach: leveraging urllib3's existing read timeout mechanism (_request_timeout) to break out of the blocking read, then catching the resulting ReadTimeoutError/ProtocolError exceptions
  • The _ prefix on _health_check_interval follows the existing convention in this codebase (e.g., _preload_content, _request_timeout) for parameters that are consumed by the client library rather than passed to the API server
  • 5 new unit tests added, all 24 tests (19 existing + 5 new) pass with zero regressions

Does this PR introduce a user-facing change?

Added `_health_check_interval` parameter to `watch.stream()` to detect and recover from silent connection drops during Kubernetes control plane upgrades. When set to a value > 0 (seconds), the watch will automatically reconnect if no data is received within the specified interval. Disabled by default for backward compatibility.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 18, 2026
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Urvashi0109
Once this PR has been reviewed and has the lgtm label, please assign yliaog for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from fabianvf and yliaog March 18, 2026 11:42
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 18, 2026
@Urvashi0109 Urvashi0109 marked this pull request as ready for review March 18, 2026 11:43
Copilot AI review requested due to automatic review settings March 18, 2026 11:43
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2026
@k8s-ci-robot k8s-ci-robot requested a review from roycaihw March 18, 2026 11:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an optional “health check” mechanism to Watch.stream() intended to detect silent watch connection drops (e.g., during control plane upgrades) by configuring timeouts and retrying from the last observed resource_version.

Changes:

  • Introduces _health_check_interval parameter in Watch.stream() and handles ReadTimeoutError/ProtocolError to trigger reconnects.
  • Auto-populates _request_timeout from _health_check_interval when not explicitly provided.
  • Adds unit tests covering reconnect behavior, default behavior, timeout propagation, and request-timeout argument handling.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
kubernetes/base/watch/watch.py Adds _health_check_interval handling, sets timeouts, and retries on read/connection errors.
kubernetes/base/watch/watch_test.py Adds tests validating reconnect + timeout configuration/compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +256 to +260
# If health check is enabled, treat a read timeout as a
# silent connection drop and allow the outer while loop
# to reconnect using the last known resource_version.
if health_check_interval > 0:
pass # Fall through to retry logic below
Comment on lines +255 to +262
except (ReadTimeoutError, ProtocolError) as e:
# If health check is enabled, treat a read timeout as a
# silent connection drop and allow the outer while loop
# to reconnect using the last known resource_version.
if health_check_interval > 0:
pass # Fall through to retry logic below
else:
raise
# Verify _request_timeout was set to the health check interval
fake_api.get_namespaces.assert_called_once_with(
_preload_content=False, watch=True,
timeout_seconds=10, _request_timeout=30)
Comment on lines +772 to +775
# Verify the user's _request_timeout (60) was preserved, not overridden
fake_api.get_namespaces.assert_called_once_with(
_preload_content=False, watch=True,
timeout_seconds=10, _request_timeout=60)
Comment on lines +216 to +217
if health_check_interval > 0 and '_request_timeout' not in kwargs:
kwargs['_request_timeout'] = health_check_interval
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fail watch gracefully on control plane upgrade

3 participants